Our goal is to conduct a case study to demonstrate the potential value of Geospatial Data Science & Analysis, by integrating publicly available data from multiple sources to build a spatial interaction models, to determine factors affecting urban mobility patterns of public bus transit.
Recentdeployments of massively pervasive computing technologies such as in-vehicle GPS and SMART cards by public transport provide plenty of data for tracking movement patterns through space and time, but the explosive growth of data has outstripped public services’ ability to utilise, transform, and understand the data.
Modifible Areal Unit Problem (MAUP): - MPSZ is too coarse, too huge a subzone area; people may live only in a corner of the subzone - Planning subzones too large, so we use analytical hexagon
“Map is not interesting, pattern revealed by and factors affecting the map is interesting. Can we explain this by building a spatial model?” - Building a model to explain flows; Spatial Interaction Model
GLM: Generalised Linear-Regression Model (over linear model):
Dwelling Units as proxy for population;
Can compare HDB-only VS dwelling units vs room-flat vs 1/2/3/ room flat unit
POI: points of interest name & type
Important Note
Due to the nature of EDA and Data Analysis, parts of this page have been Collapsed or placed behind tabs, to avoid excessive scrolling.
For easier reading, this study is also presented in point-form.
0.1 Dataset used
0.1.1 Aspatial Dataset
Dataset, Purpose & Source:
Key columns
hdb.csv : HDB Property information data with geocoding
Skipping install of 'patchwork' from a github remote, the SHA1 (51a6eff5) has not changed since last install.
Use `force = TRUE` to force installation
library(patchwork)
1.2 Import Geospatial Data
1.2.1 raw_bus_stop_sf: Load Geospatial Bus Stop Data
First, we load BusStop shapefile data from LTA Datamall;
st_read() is used to import the ESRI Shapefile data into an sf dataframe.
From previous take-home exercise, we know BusStop has the incorrect CRS (coordinate reference system), as EPSG 9001 instead of 3414, so we use st_set_crs() to correct this
Warning: The shape mpsz_sf is invalid. See sf::st_is_valid
We note there are a number of bus stops outside Singapore’s boundaries; Specifically, three bus stops in a cluster just outside the Causeway, and one further North.
We perform several steps to isolate & check the data;
we use st_filter() to find bus stops within Singapore’s Administrative National Boundaries, and create sg_bus_stop_sf for future use.
to check what bus stops have been dropped, we use anti_join() to compare raw_bus_stop_sf with sg_bus_stop_sf. We use st_drop_geometry on both sf dataframes to only compare the non-geometry columns.
Joining with `by = join_by(BUS_STOP_N, BUS_ROOF_N, LOC_DESC)`
BUS_STOP_N BUS_ROOF_N LOC_DESC
1 47701 NIL JB SENTRAL
2 46239 NA LARKIN TER
3 46609 NA KOTARAYA II TER
4 46211 NIL JOHOR BAHRU CHECKPT
5 46219 NIL JOHOR BAHRU CHECKPT
We see there are in fact 5 bus stops outside of Singapore (including the defunct Kotaraya II Terminal) that have been removed, which means our data cleaning was correct.
1.3 Geospatial Data Cleaning
1.3.1 Removing Duplicate Bus Stops
From Take-home Exercise 1, we know that there are a number of repeated bus stops. We repeat some steps;
We use length() to find the total number of raw values in the $BUS_STOP_N column of sg_bus_stop_sf;
We then compare this to length(unique()) to find the unique values;
And, indeed, we find there are 16 bus stops that are repeated;
cat("\nResults before removing duplicates: \n=======================================================\n")
Results before removing duplicates:
=======================================================
cat("Total number of rows in sg_bus_stop_sf\t\t: ", paste0(length(sg_bus_stop_sf$BUS_STOP_N)))
Total number of rows in sg_bus_stop_sf : 5156
cat("\nTotal unique bus stop IDs in sg_bus_stop_sf\t: ", paste0(length(unique(sg_bus_stop_sf$BUS_STOP_N))))
Total unique bus stop IDs in sg_bus_stop_sf : 5140
cat("\nRepeated bus stops\t\t\t\t: ", paste0(length(sg_bus_stop_sf$BUS_STOP_N) -length(unique(sg_bus_stop_sf$BUS_STOP_N))))
Repeated bus stops : 16
It appears there are 16 datapoints that are specifically repeated; let’s remove them by deleting the duplicated rows:
we use duplicated() to identify rows with repeated values of $BUS_STOP_N; duplicated rows will return TRUE while all other rows will return FALSE
We use ! to invert the values, so only the unduplicated rows will return TRUE.
We then use square brackets [] to index sg_bus_stop_sf based on the rows, and return only the unduplicated rows;
We then assign the output using <- into bus_stop_sf, for use.
Simple feature collection with 6 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 13228.59 ymin: 30391.85 xmax: 41603.76 ymax: 44206.38
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
1 22069 B06 OPP CEVA LOGISTICS POINT (13576.31 32883.65)
2 32071 B23 AFT TRACK 13 POINT (13228.59 44206.38)
3 44331 B01 BLK 239 POINT (21045.1 40242.08)
4 96081 B05 GRACE INDEPENDENT CH POINT (41603.76 35413.11)
5 11561 B05 BLK 166 POINT (24568.74 30391.85)
6 66191 B03 AFT CORFE PL POINT (30951.58 38079.61)
show code
cat("\nResults after removing duplicates: \n=======================================================\n")
Results after removing duplicates:
=======================================================
show code
cat("Total number of rows in bus_stop_sf\t\t: ", paste0(length(bus_stop_sf$BUS_STOP_N)))
Total number of rows in bus_stop_sf : 5140
show code
cat("\nTotal unique bus stop IDs in bus_stop_sf\t: ", paste0(length(unique(bus_stop_sf$BUS_STOP_N))))
Total unique bus stop IDs in bus_stop_sf : 5140
show code
cat("\nRepeated bus stops\t\t\t\t: ", paste0(length(bus_stop_sf$BUS_STOP_N) -length(unique(bus_stop_sf$BUS_STOP_N))))
Repeated bus stops : 0
We can do a quick check to visualize these;
We use tmap_mode("view") to allow us to scroll around and check that the bus stops fall within Singapore’s national boundaries, and set zoom limits to focus the attention
We combine tm_shape() and tm_polygons() to map the master plan subzones in grey;
We combine tm_shape() and tm_dots() to map locations of bus stops; for visual distinction with the grey zones, we use the “Spectral” palette
Warning: The shape mpsz_sf is invalid (after reprojection). See sf::st_is_valid
Warning: Number of levels of the variable "BUS_STOP_N" is 5140, which is larger
than max.categories (which is 30), so levels are combined. Set
tmap_options(max.categories = 5140) in the layer function to show all levels.
show code
tmap_mode("plot")
tmap mode set to plotting
Now, we can start preparing the hexagon map.
1.4 Generating Hexagon Maps
We generate the hexagon map in three steps:
We use st_make_grid() with square = FALSE to create the hexagon layer as defined in the study, which we name raw_hex_grid;
We pass cellsize = 750 to create the hexagons of necessary size. Prof Kam defined the apothem as 375m, as the Traffic Analysis Zone is typically 750m in size.
I used units::as_units to pass 750 metres into the argument. I am still uncertain whether a length of 750m needs to be reprojected, or whether we need to do any further transformation.
We use st_transform() just in case we need to reproject the coordinate system, just in case.
We use st_sf() to convert raw_hex_grid into an sf dataframe for further manipulation later; - However, trying to visualize this right now just gives us a map full of hexagons, so need to eliminate the empty hexagons; - We use mutate() to create a new column, $n_bus_stops that counts the number of bus stops in each hexagon using lengths(st_intersects())
st_intersects() gives us a list of bus stops in each hexagon, so we use lengths() to count the number
We create our final sf dataframe, hexagon_sf in two steps; - First, we use filter() to select only hexagons with nonzero number of bus stops; - Then, mutate() is used here to create a grid_id column, labelling only the hexagons with nonzero bus stops.
Finally, we perform a quick plot to confirm that every bus stop is inside a hexagon;
Using hexagon_sf as our base layer, we use tmap_shape() and tm_polygons() to visualizes our hexes. We pass palette = "Spectral" for visual distinguishment against the black dots of the bus stops.
We use tmap_shape() and tm_dots() to visualize our bus stop as black dots
show code
# STEP 1 - Create hexagon map raw_hex_grid <-st_make_grid(bus_stop_sf, cellsize = units::as_units(375, "m"), what ="polygons", square =FALSE) %>%st_transform(crs =3414)# STEP 2 - Convert to sf object and count the number of bus stops inside;raw_hex_grid <-st_sf(raw_hex_grid) %>%mutate(n_bus_stops =lengths(st_intersects(raw_hex_grid, sg_bus_stop_sf))) # Count number of points in each grid, code snippet referenced from: # https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r# STEP 3 - Filter for nonempty hexes and label:hexagon_sf <-filter(raw_hex_grid, n_bus_stops >0) hexagon_sf <- hexagon_sf %>%mutate(grid_id =1:nrow(hexagon_sf)) %>%select(grid_id, raw_hex_grid, n_bus_stops)tmap_mode("plot")
We can visually confirm that every black dot is within a hexagon.
We re-use a bit of code from Take-Home Exercise 1 to plot our analytical hexagons on a map of Singapore and visualize the number of bus stops in each hexagon:
Rows: 5694297 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): YEAR_MONTH, DAY_TYPE, PT_TYPE, ORIGIN_PT_CODE, DESTINATION_PT_CODE
dbl (2): TIME_PER_HOUR, TOTAL_TRIPS
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This is a huge tibble dataframe with over 5 million rows, so we will filter this now by peaks;
For this study, we focus on Weekday afternoon peak
1.5.2 Filter For Peaks – Weekday Afternoon
We now perform a multi-step filter process;
We combine mutate() with case_when() to create a new column, $PEAK, based on the study criteria:
We set the value to “WEEKDAY_AFTERNOON_PEAK” if $DAY_TYPE is “WEEKDAY” and bus tap-on time (e.g. $TIME_PER_HOUR) is between 5 pm and 8pm, inclusive;
Note that we convert the values for $TIME_PER_HOUR to 24-hour clock, e.g. “5pm” is “17” hundred hours, “8pm” is “20” hundred hours.
For all remaining values, we assign an “Unknown” value.
We then use filter() to eliminate those with “Unknown” $PEAK values, i.e. rows outside the period of interest for the study
We use group_by() to group the values by $ORIGIN_PT_CODE and $DESTINATION_PT_CODE, and use summarise() to aggregate the sum of $TOTAL_TRIPS as a new column, $TRIPS.
We use write_rds() to save the output dataframe, odbus_filtered, as an RDS object for future reference/load.
1.6 Combine Bus Trip Data With hexagon_sf Dataframe
For our study purposes, we need to have the number of bus trips originating from each hexagon. In order to achieve this, we must combine our three current dataframes:
hexagon_sf, an sf dataframe with $grid_id column (primary key) and the specific polygon geometry of each hexagon;
bus_stop_sf, an sf dataframe with the $BUS_STOP_N (primary key) and the point geometry of each bus stop;
odbus_filtered, a tibble dataframe with the $ORIGIN_PT_CODE (primary key) column and the trip details for each of the four peak periods of interest.
1.6.1 bus_stop_hexgrid_id: Identify Hexagon grid_id For Each Bus Stop
First, we combine hexagon_sf with sg_bus_stop_sf ;
We use st_intersection to combine the dataframes such that each row of sg_bus_stop_sf has an associated grid_id;
We use select() to filter the resultant bus_stop_hexgrid_id dataframe to only $grid_id and $BUS_STOP_N columns, and use st_drop_geometry() to convert into a simple dataframe with just two columns:
We finally create the final version of our flow dataset.
Just to be safe, we perform a step to remove duplicates using unique()
drop_na() removes bus stops for which we have no info. It turns out there are bus stop numbers outside our bus stop dataset, like 03361, 03549, 03579, 59009;
Some of these may be new bus stops, like 03361, Gardens by the Bay Stn Exit 2, and 03549
We use group_by() to combine rows with the same $ORIGIN_HEX and $DESTIN_HEX, and aggregate the number of trips with summarise()
grid_id_from grid_id_to dist
Min. : 1 Min. : 1 Min. : 375
1st Qu.: 543 1st Qu.: 543 1st Qu.: 7875
Median :1086 Median :1086 Median :12898
Mean :1086 Mean :1086 Mean :13607
3rd Qu.:1629 3rd Qu.:1629 3rd Qu.:18221
Max. :2171 Max. :2171 Max. :44554
Minimum distance 375
We will arbirtraitly insert 50m into interzonal distance
grid_id_from grid_id_to dist
Min. : 1 Min. : 1 Min. : 50
1st Qu.: 543 1st Qu.: 543 1st Qu.: 7875
Median :1086 Median :1086 Median :12898
Mean :1086 Mean :1086 Mean :13600
3rd Qu.:1629 3rd Qu.:1629 3rd Qu.:18221
Max. :2171 Max. :2171 Max. :44554
We also observe that number of trips for Weekday Morning & Weekday Afternoon seems to be larger than Weekend Morning and Weekend Evening peak trips. This is also confirmed by the figure in the next section.
This means that we will have to consider Weekday and Weekend peaks on different scales.
This is an exceptionally ugly plot, but it shows an important point: there is some serious right skew in our dataset;
Clearly, there are some hexagons with exceptionally high trips compared to the rest of the hexagons But, could this be because some hexagons have up to 11 bus stops, while others have 1 or 2?
We do a quick scatter plot based on $n_bus_stops to verify this:
show code
# hexagon_sf %>% # st_drop_geometry() %>% # pivot_longer(cols = starts_with("WEEK"),# names_to = "PEAK", values_to = "TRIPS") %>%# ggplot( aes(x=TRIPS, y=n_bus_stops, color=PEAK, shape=PEAK)) +# geom_point(size=2) +# ggtitle("Scatterplot: Trips over peak periods by number of bus stops per hexagon, 2023-Oct data") +# theme(legend.position=c(.85, .15),# legend.background = element_rect(fill = "transparent"),# legend.key.size = unit(0.5, "cm"), # legend.text = element_text(size = 6),# legend.title = element_text(size = 8)# )
Surprising results from our plot! If we consider those with > 100,000 trips as outliers, most of them come from hexagons with between 4-8 bus stops;
There is some correlation between number of bus stops and high numbers of trips, but a stronger factor is peak time; Weekday Morning peak trips, followed by Weekday Afternoon peak trips, contribute to the largest outliers.
Tip
I note that these visualizations only scrape the surface of understanding the data. However, this is not the focus of our study; we do these quick visualizations only to provide better context for our study.
2. Preparing three propulsive and three attractiveness variables
Reading layer `RapidTransitSystemStation' from data source
`C:\1darren\ISSS624\Take-home_Ex02\data\geospatial' using driver `ESRI Shapefile'
Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, : GDAL
Message 1: Non closed ring detected. To avoid accepting it, set the
OGR_GEOMETRY_ACCEPT_UNCLOSED_RING configuration option to NO
Simple feature collection with 220 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21